Authors: Chung, Nguyen, Pillay and Wang
Southern Methodist University
Kaggle dataset (https://www.kaggle.com/stefanoleone992/fifa-20-complete-player-dataset) players_20.csv is used in this lab. The dataset provides detailed information of all the soccer players statistics in various clubs of major soccer leagues in the world. The data is originally from the FIFA soccer game created by EA sports. This game estimates the abilities of the actual players and built the game according to the data. And the Kaggle dataset is scraped from www.sofifa.com, where the gaming data is collected. The data we used is last updated on Sept 19th 2019.
The dataset is important because the abilities of players is estimated from the actual players. FIFA is a very popular game and EA Sports is one of the largest sport video game developer. The player abilities estimate is quite accurate. Therefore, useful knowledge can be mined from the data for analyzing variety problems in the soccer industry. For example, wage analysis, player analysis, training strategy, budget analysis, sport gambling strategy plan can be performed using this dataset. Soccer has been a big industry which the market size is estimated to worth $488 billion in 2018 according to the Business Wire.
Some of the analyses we are interested in to performs are:
Run predictive model to predict wage from players abilities. We will select some features to run regression model. Also, we will try running PCA and regression model. To measure the effectiveness, we will use RMSE, MAE and R squared.
Run classification model to classify players position from the players ability. To validate our model, we will use accuracy, precision, F1 score and ROC.
We also intent to use the detailed game statistics from fbref.com. We can potential run analysis using the win/lose results from the real time data.
şş Image captured from https://en.wikipedia.org/wiki/FIFA_20
#All Python module imports
#https://pandas.pydata.org/docs/user_guide/index.html#user-guide
import pandas as pd #Pandas Dataframe module
import numpy as np
from math import pi
#scikit learn
#https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model
import sklearn as sl
import pycountry
import plotly.express as px
#https://seaborn.pydata.org
import seaborn as sns
import matplotlib.pyplot as plt
# os calls
import os
#Module for formating table for documentation
#https://pypi.org/project/tabulate/
from tabulate import tabulate
from IPython.display import display, Markdown
#Interactive mode
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from IPython.display import Image
Due to the large number of features we have created a csv file that will serve as a metadata file as follows
1) File location: ../data/fifa20/data_desc.csv
2) The Name column as the exact name that is in our original dataset file and we will use this to load the file.
3) Description: The team has filled up the description of each feature describing it.
4) Variable_cat: We use this to categorise related features for example 'Position' indicates all the features related to various position attirbutes.
Using a metadata makes it easy in case we want to ignore certain attributes that are not relavant for our analysis by marking it as Ignore.
#Read data explanation csv file maintained by project team
%time df_desc = pd.read_csv('../data/fifa20/data_desc.csv',usecols=['Name','Description','Statistics','Variable_cat'])
display(Markdown(df_desc[['Name','Description','Variable_cat']].to_markdown()))
#All positional attributes, These are various field positions a players can playin with a players statisics in
#that position. These attributes are only for non-goalkeepers.
df_position_attr = df_desc[df_desc['Variable_cat']=='Position']['Name']
#All goalKeeper only attributes
df_gk_attr = df_desc[df_desc['Variable_cat']=='GK']['Name']
#All Technical attributes. Non GK players primary technical skills such as dribbling, curving etc.
df_technical_attr = df_desc[df_desc['Variable_cat']=='Technical']['Name']
#All mental attributes for all players
df_mental_attr = df_desc[df_desc['Variable_cat']=='Mental']['Name']
#All Attacking attributes for all players
df_attacking_attr = df_desc[df_desc['Variable_cat']=='Attacking']['Name']
#All general Skill category attributes for all players
df_skill_attr = df_desc[df_desc['Variable_cat']=='Skill']['Name']
#All Movement attributes for all players
df_movement_attr = df_desc[df_desc['Variable_cat']=='Movement']['Name']
#All Power attributes for all players
df_power_attr = df_desc[df_desc['Variable_cat']=='Power']['Name']
#All Defending attributes for all players
df_defending_attr = df_desc[df_desc['Variable_cat']=='Defending']['Name']
# Sample code for future use to include adhoc columns
# Get all columns of player posiotion, you can do same for technical, Physical, GK etc
## l=df_gk_attr.append(pd.Series(['short_name','age','club']))
## df_gkplus_list from main object df_players = df_players[l]
This image helps us understand various positions that a soccer players is assigned during a game. The feature player_positions represents multiple positions that a player can play in with a comma seperate value, the very first item in that list has a preferred field position of the player. The team_position is the current positon that the player is being used in the team. For our analysis we will use the players preferred positon.
The statistics of each individual player ability in respective positons are in seperate features on a scale of 1 to 100 as described in the table output aboive.
A goal keeper does not occupy any of the positions that a regular player occupies under normal circumstances which means he will not have any specific statistics for those features. A goal keeper on the other hand has a different set of features to capture his statistics, these are displayed in the table above in the Variable_cat field as 'GK'.
#Data mining read csv file
#Using data set: https://www.kaggle.com/stefanoleone992/fifa-20-complete-player-dataset#players_15.csv
#we will read only data we are intrested in
cols_to_read = df_desc['Name'].loc[df_desc.Variable_cat != 'Ignore']
%time df_players = pd.read_csv('../data/fifa20/players_20.csv',usecols=cols_to_read)
df_players.shape
df_players.describe()
df_players.info(verbose=True, null_counts=True)
Data types: float64(16), int64(45), object(43)
There is no spacing being used in column names, no need to change the column names.
We will fix all these and the data conversion (continous, discret, ordinal, nominal) in the later section after doing some further analysis.
df_players['player_positions'].value_counts()
df_players['nation_position'].value_counts()
df_players['team_position'].value_counts()
Note: There are a 240 players with 0 wages_eur and 250 with with 0 value_eur, we will fill these with median values for our predictions.
df_players['wage_eur'].value_counts(bins=10)
df_players['wage_eur'].loc[df_players.wage_eur <=0]
df_players['value_eur'].value_counts(bins=10)
df_players['value_eur'].loc[df_players.value_eur <=0]
df_players['preferred_foot'].value_counts()
Note: It looks like seven players were not categorized properly. During data cleanup stage we will set them to be normal in case we decide to use it for prediction.
df_players['body_type'].value_counts()
df_players['club'].value_counts()
As we mentioned above player_positions represents multiple positions that a player can play in with a comma seperate value, the very first item in that list has a preferred field position of the player. We will now extract the first value into preferred_position.
df_players['preferred_position'] = df_players.player_positions.str.split(',').apply(lambda x: x[0])
We see the team_position has 241 empty values. Since we do not intent to use team_position nor nation_positon in our analysis, we will drop it.
df_players = df_players.drop(columns=['nation_position','team_position'])
df_players.shape
It is critical to properly identify goal keepers vs rest of the other players as many attributes are specific to either goal keepers or regular players. We will be using this a lot in our data cleanup and analysis later.
#goal keeper count
df_players[df_players['preferred_position'] =='GK'].shape
#regular player count
df_players[df_players['preferred_position'] !='GK'].shape
We will use this later to predict player positions
df_players.preferred_position.value_counts()
df_players['preferred_position_cat'] = df_players.preferred_position.map({
'CB': 'DEF',
'ST': 'FWD',
'CM': 'MID',
'GK': 'GK',
'CDM': 'DEF',
'RB': 'DEF',
'LB': 'DEF',
'CAM': 'FWD',
'RM': 'MID',
'LM': 'MID',
'LW': 'FWD',
'RW': 'FWD',
'CF': 'FWD',
'LWB': 'DEF',
'RWB': 'DEF'
})
df_players.preferred_position_cat.value_counts()
As seen on analysis in the above section, there are 240 players with 0 wages_eur and 250 with with 0 value_eur, we will fill these with median values for our predictions.
# 1. wages, The fields wage_eur and value_eur have about 240 and 250 vales respectively set to 0 (Analyzed in
# section B), we will set these values to the respective median values
df_players['wage_eur'].median()
df_players['wage_eur'].replace(0,df_players['wage_eur'].median(), inplace=True)
df_players['value_eur'].median()
df_players['value_eur'].replace(0,df_players['value_eur'].median(), inplace=True)
We identified six players were not categorized properly, so we will set them to be normal in case we decide to use it for prediction.
df_players['body_type'].replace({'Shaqiri': 'Normal', 'Akinfenwa': 'Normal'
, 'C. Ronaldo': 'Normal', 'PLAYER_BODY_TYPE_25': 'Normal'
, 'Neymar': 'Normal', 'Courtois': 'Normal','Messi':'Normal'}, inplace=True)
df_players['body_type'].value_counts()
The position features inculdes year over improvement / decrement. For example a player in st (Striker) position with a value of 89+2 is basically 89 with an increase (- would indicate decrease) of 2 from last year. Since we will not be doing any year over analysis, we will remove the suffix for these columns = ['ls','st','rs','lw','lf','cf','rf', 'rw','lam','cam','ram', 'lm','lcm','cm','rcm','rm','lwb','ldm', 'cdm','rdm', 'rwb','lb','lcb', 'cb','rcb','rb']
# 4. Refactor position features based on bullet 1 above to remove the +/- incriment/decriment and preseve
# the current years score
df_players1 = df_players[df_position_attr].apply(lambda x: x.str.slice(start=0,stop=2),
axis=1, result_type='broadcast')
df_players[df_position_attr] = df_players1
The player positional features mentioned above and the skills features ['pace','shooting', 'passing','dribbling', 'defending','physic'] have a total of 2036 missing values. We see that all of them are categorized to play in the GK position (player_positions="GK", Goal Keeper). We will set these values to zero.
The following features ['gk_diving','gk_handling','gk_kicking','gk_reflexes', 'gk_speed','gk_positioning'] are specific to goalkeepers. We see a total of 16242 mising values, which corrosponds to the non goalkeeper or regular positional players. We will set these values to zero. Also there are duplicate columns representing these same attributes which we will ignore ['goalkeeping_diving', 'goalkeeping_handling', 'goalkeeping_kicking', 'goalkeeping_positioning', 'goalkeeping_reflexes']
df_players.info(verbose=True, null_counts=True)
# 5, 6. we will set all the other missing variables to 0 as explained above in bullet 4 & 5
df_players.fillna(0, inplace=True)
We will first convert our continuous features.
# changing the continuous values to be float64
continuous_features = ['height_cm','weight_kg','value_eur', 'wage_eur']
df_players[continuous_features] = df_players[continuous_features].astype(np.float64)
sofifa_id is converted to object as it is a nominal
df_players['sofifa_id'] = df_players['sofifa_id'].astype(np.object)
We will convert object features that are discrete to integer
#change the position features from object to int64 as they we manipulated to remove +/- based on step 1
df_players[df_position_attr] = df_players[df_position_attr].astype(np.int64)
df_players.info(verbose=True, null_counts=True)
Data types: float64(17), int64(60), object(10)
No missing values
# Our cleaned dataset **copy** is df_players_cleaned
df_players_cleaned = df_players.copy()
df_players_cleaned.shape
#We will now also have **views** for goal keepers in df_players_gk and regular
df_players_gk = df_players_cleaned[df_players_cleaned['player_positions'] =='GK']
df_players_gk.shape
df_players_regular = df_players_cleaned[df_players_cleaned['player_positions'] !='GK']
df_players_regular.shape
We selected some attributes to explore.
# we are grouping different types of atributes into different feature groups
player_continuous_features = ['age','height_cm','weight_kg','value_eur', 'wage_eur']
The age range is betwee 16 and 42 with average 25. It is interesting that the player can be as young as 16 to be a professional soccer player.
The average wage is 9496 euro with very high standard deviation 21336 euro. It makes sense that wage vary and super stars can earn a lot more, but it is surprising that many of the professional player are not getting high wages. Also, the maximum weight is 110. We will look into this outliner.
# we are adding more percentile to the describe
df_players_cleaned[player_continuous_features].describe([.05,.1,.25,.5,.75,.9,.95])
The other attributes we selected are related to the abilities. The mentality, attacking and movement abilities attribute are applied to all positions. The minimum scores are very low and the standard deviation is farily high because some positions don't require skills that are in different positions. For instance, a goal keeper doesn't require much of the attacking skills. This statistics run is important because we realize position is an important factor of the skill scores. We can run analysis by positions separately.
df_players_cleaned[df_mental_attr].describe([.05,.1,.25,.5,.75,.9,.95])
df_players_cleaned[df_attacking_attr].describe([.05,.1,.25,.5,.75,.9,.95])
df_players_cleaned[df_movement_attr].describe([.05,.1,.25,.5,.75,.9,.95])
We observed that the range of the technical skills for the regular players are pretty high. The average of the scores are around 60, so we anticipate there are some outliers with lower scores. For the pace, passing dribbling, and physic, the lowest 5% of the player have scores around 40 and the top 5% of the players have scores around 80. For the shooting and defending, the standard deviation is higher because these skills are more position specific.
df_players_regular[df_technical_attr].describe([.05,.1,.25,.5,.75,.9,.95])
Except the speed, the average scores are around 65. The score of speed is much lower than the other goalkeeper skills because goal keeper doesn't need to run.
df_players_gk[df_gk_attr].describe([.05,.1,.25,.5,.75,.9,.95])
In the box plot below we see the median wage for a player in center forward is the highest and the distribution for CF for value is larger too. The players with left foot are paid more on average in some positions. For example "RW", "RM", "LWB", where as in "CF", "RB" it is the opposite. We are not sure the reasoning.
#Plotting wages distribution on log scale by position
plt.figure(figsize=(20,5))
ax = sns.boxplot(data=df_players, y='wage_eur', x='preferred_position', hue='preferred_foot');
ax.set_yscale('log');
ax.set_title('Wages grouped by preferred position & preferred foot', fontsize=20);
ax.set_xlabel('preferred Position', fontsize=15);
ax.set_ylabel('Wage (€)', fontsize=15);
# plotting values distribution with sns
plt.figure(figsize=(20,5));
ax1 = sns.boxplot(data=df_players, y='value_eur', x='preferred_position', hue='preferred_foot');
ax1.set_yscale('log');
ax1.set_title('Values grouped by preferred position & preferred foot', fontsize=20);
ax1.set_xlabel('Preferred Position', fontsize=15);
ax1.set_ylabel('Value (€)', fontsize=15);
#df_players['wage_eur'].describe()
Goal keepers and center back on average are taller than other players.
plt.figure(figsize=(20,5));
sns.set();
ax_height = sns.violinplot(y=df_players['height_cm'], x=df_players['preferred_position'], dodge=True);
ax_height.set_title('Height vs Position', fontsize=20);
ax_height.set_xlabel('Preferred Position', fontsize=15);
ax_height.set_ylabel('Height (cm)', fontsize=15);
Right and Left wing players on average are lighter than other players. The reason could be the wing players often need to run back and forth to the goalline.
plt.figure(figsize=(20,5));
sns.set();
ax_weight = sns.violinplot(y=df_players['weight_kg'], x=df_players['preferred_position'], dodge=True);
ax_weight.set_title('Weight vs Position', fontsize=20);
ax_weight.set_xlabel('Preferred Position', fontsize=15);
ax_weight.set_ylabel('Weight (kg)', fontsize=15);
In the plot below, we can see the gap between the potential of a player and his overall rating reduces as the player age increases. By the average age of 28, they merge and becomes the same. This makes sence as the potential of the younger players is higher. As the players age, they come closer to their overall rating.
We see a spike in the overall rating at age 41 due to the fact that there are only few players at that age. The few high ratings are pulling the average up.
df_players_cleaned_p = df_players_cleaned.groupby(['age'])['potential'].mean()
df_players_cleaned_o = df_players_cleaned.groupby(['age'])['overall'].mean()
df_players_cleaned_summary = pd.concat([df_players_cleaned_p, df_players_cleaned_o], axis=1)
ax_summary = df_players_cleaned_summary.plot();
ax_summary.set_ylabel('Rating');
ax_summary.set_xlabel('Age');
ax_summary.set_title('Average Rating by Age');
plt.show();
şş We use plotly (https://plotly.com/python/choropleth-maps/) to plot the count of player by country.
#Map - to show how many players by country
import warnings
warnings.filterwarnings("ignore")
pdf7 = df_players_cleaned
pdf7 = pdf7[['nationality']]
nat = []
for i in range(len(pdf7)):
nat.append(1)
pdf7['Number of players'] = nat
for i in range(len(pdf7)):
if pdf7.nationality[i] == 'Antigua & Barbuda':
pdf7.nationality[i] = 'Antigua and Barbuda'
elif pdf7.nationality[i] == 'Bosnia Herzegovina':
pdf7.nationality[i] = 'Bosnia and Herzegovina'
elif pdf7.nationality[i] == 'Cape Verde':
pdf7.nationality[i] = 'Republic of Cabo Verde'
elif pdf7.nationality[i] == 'Central African Rep.':
pdf7.nationality[i] = 'Central African Republic'
elif pdf7.nationality[i] == 'China PR':
pdf7.nationality[i] = 'China'
elif pdf7.nationality[i] == 'Chinese Taipei':
pdf7.nationality[i] = 'Taiwan'
elif pdf7.nationality[i] == 'DR Congo':
pdf7.nationality[i] = 'Congo'
elif pdf7.nationality[i] == 'Democratic Republic of the Congo':
pdf7.nationality[i] = 'Congo'
elif pdf7.nationality[i] == 'FYR Macedonia':
pdf7.nationality[i] = 'Macedonia'
elif pdf7.nationality[i] == 'Guinea Bissau':
pdf7.nationality[i] = 'Guinea-Bissau'
elif pdf7.nationality[i] == 'Trinidad & Tobago':
pdf7.nationality[i] = 'Trinidad and Tobago'
elif pdf7.nationality[i] == 'São Tomé & PrÃncipe':
pdf7.nationality[i] = 'São Tomé and PrÃncipe'
elif pdf7.nationality[i] == 'Ivory Coast':
pdf7.nationality[i] = "Côte d'Ivoire"
elif pdf7.nationality[i] == 'Korea DPR':
pdf7.nationality[i] = "Democratic People's Republic of Korea"
elif pdf7.nationality[i] == 'Korea Republic':
pdf7.nationality[i] = "Republic of Korea"
elif pdf7.nationality[i] == 'Macau':
pdf7.nationality[i] = 'China'
elif pdf7.nationality[i] == 'Republic of Ireland':
pdf7.nationality[i] = 'Ireland'
elif pdf7.nationality[i] == 'St Kitts Nevis':
pdf7.nationality[i] = 'Saint Kitts and Nevis'
elif pdf7.nationality[i] == 'St Lucia':
pdf7.nationality[i] = 'Saint Lucia'
elif pdf7.nationality[i] == 'England':
pdf7.nationality[i] = 'United Kingdom'
elif pdf7.nationality[i] == 'Northern Ireland':
pdf7.nationality[i] = 'United Kingdom'
elif pdf7.nationality[i] == 'Scotland':
pdf7.nationality[i] = 'United Kingdom'
elif pdf7.nationality[i] == 'Wales':
pdf7.nationality[i] = 'United Kingdom'
pdf8 = pdf7.groupby('nationality', as_index=False).sum()
pdf8 = pd.DataFrame(pdf8)
list_countries = pdf8['nationality'].unique().tolist()
d_country_code = {}
for country in list_countries:
try:
country_data = pycountry.countries.search_fuzzy(country)
country_code = country_data[0].alpha_3
d_country_code.update({country: country_code})
except:
print('We could not add ISO 3 code for ->', country)
d_country_code.update({country: ' '})
for k, v in d_country_code.items():
pdf8.loc[(pdf8.nationality == k), 'ISO3'] = v
fig = px.choropleth(data_frame = pdf8,
locations= "ISO3",
color= 'Number of players',
hover_name= "nationality",
#color_continuous_scale = 'Plasma',
color_continuous_scale= ["white","green","blue"],
title = 'Number of players per country',
)
fig.show()
By this map, we can see that the country with the most number of players is UK. (Here we regroup England, Northern Ireland, Scotland and Wales as UK). The second country is Germany (dark green). It is an interactive map that users can click in each country to see the ISO code and number of players.
pdf8.sort_values(by=['Number of players'], ascending=False).head(10)
We will take a look at some correlation between various players skill attributes based on different categories and particularly see how they influence a player's value log scale in terms of euros. We will run an analysis seperately for goal keepers and regular players.
#prepare the plot pallete
cmap = sns.diverging_palette(220, 10, as_cmap=True) # one of the many color mappings
# log value
df_reg_copy = df_players_regular.copy()
df_reg_copy['lvalue_eur'] = np.log(df_reg_copy['value_eur'])
Evaluating players basic overall and technical skills for relation to their value in euro.
#analyse Technical skills of regular Non GK
l=df_technical_attr.append(pd.Series(['overall','lvalue_eur','preferred_foot']))
sns.pairplot(df_reg_copy[l], height=2, hue='preferred_foot');
Below is heat map to show the correlation, which confirms the correlation we have seen in the previous matrix plot between value_eur, passing, dribbling, and overall.
# plot the correlation matrix using seaborn
sns.set(style="darkgrid") # one of the many styles to plot using
f, ax = plt.subplots(figsize=(9, 9))
sns.heatmap(df_reg_copy[l].corr(), cmap=cmap, annot=True);
f.tight_layout();
#analyse player attacking
l=df_attacking_attr.append(pd.Series(['lvalue_eur','preferred_foot']))
sns.pairplot(df_reg_copy[l], height=2, hue='preferred_foot');
Below is the heat map to show the correlation between the attacking attributes.
# plot the correlation matrix using seaborn
sns.set(style="darkgrid") # one of the many styles to plot using
f, ax = plt.subplots(figsize=(9, 9))
sns.heatmap(df_reg_copy[l].corr(), cmap=cmap, annot=True);
f.tight_layout();
Analyzing non goal keepers for defending category skills
#analyse player defending
l=df_defending_attr.append(pd.Series(['lvalue_eur','preferred_foot']));
sns.pairplot(df_reg_copy[l], height=2,hue='preferred_foot');
Below is a heat map to show the correlation between defending attributes.
# plot the correlation matrix using seaborn
sns.set(style="darkgrid") # one of the many styles to plot using
f, ax = plt.subplots(figsize=(9, 9))
sns.heatmap(df_reg_copy[l].corr(), cmap=cmap, annot=True);
f.tight_layout();
The matrix plot below of the goal keepers primary skill sets show the following.
#analyse GK skills
df_gk_copy = df_players_gk.copy()
df_gk_copy['lvalue_eur'] = np.log(df_gk_copy['value_eur'])
l=df_gk_attr.append(pd.Series(['lvalue_eur','preferred_foot']))
sns.pairplot(df_gk_copy[l], height=2,hue='preferred_foot');
Below is a heat map to show corealtion between gk attributes.
# plot the correlation matrix using seaborn
sns.set(style="darkgrid") # one of the many styles to plot using
f, ax = plt.subplots(figsize=(9, 9))
sns.heatmap(df_gk_copy[l].corr(), cmap=cmap, annot=True);
f.tight_layout();
We plan to use the attacking, skill, movement, power and defending features to classify players' positions into the following four categories.
We have explored the correlation of each attribute in the previous section. The density plots are generated to observe the distribution of each attribute in different positions.
The score distribution is highly correlated to the position that players are playing. We anticipate it is because each position focus on specific skills, i.e. forward players are good at finishing, mid field players are good at passing, and defending players are good at tackle.
In all the density plots, the goal keeper players generally have the lowest score because the goal keeper position should be specilized in certain goal keeper skill sets, rather then other normal features.
Defending players have much higher score among the defending attributes, marking, standing tackle and sliding tackle.
Forward players have higher score in attributes mentality positioning, mentality penalties, and attacking finishing.
Mid field players have higher score in attributes mentality vision, attacking crossing, and attacking short passing.
In some attributes, mid field players have smiliar distribution with forward players. For example, skill dribbling, skill ball control, power shot power, power jumping, power long shots.
In some attributes, defending, foward and mid field players have smiliar distribution. For example, mentality composure, movement acceleration, movement sprint speed, movement reactions and power stamina.
Mid field and defending players generally have higher score in skill long passing than forward players. Forward players don't usually do long passing, so it makes sense.
#Function to plot densidy plots of the passed dataframe, attributes and figure size
def plot_density(df,attributes):
fig = plt.figure(figsize=(12, (len(attributes)/2)*2))
for index, plot_vars in enumerate(attributes):
ax = plt.subplot(len(attributes)/2, 3, index+1)
df_DEF = df.loc[df["preferred_position_cat"] == "DEF"];
df_MID = df.loc[df["preferred_position_cat"] == "MID"];
df_FWD = df.loc[df["preferred_position_cat"] == "FWD"];
df_GK = df.loc[df["preferred_position_cat"] == "GK"];
#fig, ax = plt.subplots();
ax = sns.kdeplot(data=df_DEF[plot_vars], label='DEF', ax=ax)
ax = sns.kdeplot(data=df_MID[plot_vars], label='MID', ax=ax)
ax = sns.kdeplot(data=df_FWD[plot_vars], label='FWD', ax=ax)
ax = sns.kdeplot(data=df_GK[plot_vars], label='GK', ax=ax)
ax.set(xlabel=plot_vars);
plt.tight_layout();
plt.show()
l=df_mental_attr
plot_density(df_players_cleaned,l)
l=df_attacking_attr
plot_density(df_players_cleaned,l)
l=df_skill_attr
plot_density(df_players_cleaned,l)
l=df_movement_attr
plot_density(df_players_cleaned,l)
l=df_power_attr
plot_density(df_players_cleaned,l)
We will use all of the above attributes to run our classification model. It is in the Exceptional section
We plan to use https://fbref.com/en/comps/9/Premier-League-Stats to get more data with respect to play/ wins/ goal per club and leagues data to do more analyses in the future labs.
Now we will compare the top performing and the bottom perfroming teams in the top three leagues for 2020 to see how much they differ in the main skills of players in regular postions, which we can use to build an evaluation of players' salary.
We will than use this analysis to come with a player budget to build a new team of our own and comapre to the top three teams average skill set..
şş https://fbref.com/en/comps/20/Bundesliga-Stats
The current top three leagues with the top ranked and bottom ranked are as follows
şş https://python-graph-gallery.com/391-radar-chart-with-several-individuals/ We used this site to get idea of creating spider plot.
#Common method to plot a comparative spider graph for two teams passed in.
# df: the data frame with the two team values to compare
# attributes: the corrosponding attributes these values belong to on which the plot will be based
# league: the name of the two leagues
def plot_spider(df,attributes, league):
categories=list(np.array(attributes))
#get the two club names
teams = df['club']
#categories=list(df_players_regular.loc[0,labels].values)
N = len(categories)
# What will be the angle of each axis in the plot? (we divide the plot / number of variable)
#angles=np.linspace(0, 2*np.pi, len(labels), endpoint=False)
angles = [n / float(N) * 2 * pi for n in range(N)]
angles += angles[:1]
# Initialise the spider plot
fig = plt.figure();
ax = fig.add_subplot(111, polar=True)
# If you want the first axis to be on top:
ax.set_theta_offset(pi / 2)
ax.set_theta_direction(-1)
# Draw one axe per variable + add labels labels yet
plt.xticks(angles[:-1], categories)
# ------- PART 2: Add plots
# we can loop this to make it generic
# Team 1
values=df.loc[0,categories].tolist()
values += values[:1]
ax.plot(angles, values, linewidth=1, linestyle='solid', label=teams[0])
ax.fill(angles, values, 'b', alpha=0.1)
# Team 2
values=df.loc[1,categories].tolist()
values += values[:1]
ax.plot(angles, values, linewidth=1, linestyle='solid', label=teams[1])
ax.fill(angles, values, 'r', alpha=0.1)
#ax.set_title("Spider", )
fig.suptitle(league, fontsize=20)
plt.subplots_adjust(top=0.85)
# Add legend
plt.legend(loc='upper right', bbox_to_anchor=(0.1, 0.1));
We compare the top team vs bottom team in the English premier league.
#Filter our targeted clubs
df_target_clubs_liverpool = df_players_regular[df_players_regular['club'].isin(['Liverpool','Norwich City',
'FC Bayern München','SC Paderborn 07', 'FC Barcelona','RCD Espanyol'])]
#Use the Technical skill set from metadata file
l=df_technical_attr.append(pd.Series(['club']))
#Get groupwise mean for the teams by club
df_target_clubs_liverpool = df_target_clubs_liverpool[l]
df_target_clubs_liverpool = df_target_clubs_liverpool.groupby('club').mean().reset_index()
#df_target_clubs_liverpool
d = df_target_clubs_liverpool[df_target_clubs_liverpool['club'].isin(['Liverpool','Norwich City'])].reset_index()
plot_spider(d,df_technical_attr, 'English Premier');
Below is a similar plot comparison for Bundesliga league.
d = df_target_clubs_liverpool[df_target_clubs_liverpool['club'].isin(['FC Bayern München','SC Paderborn 07'])].reset_index()
plot_spider(d,df_technical_attr, 'Bundesliga');
Below is a plot comparison for La Liga league. The gap between the highest performaing club vs lowest in this league is vsmaller.
d = df_target_clubs_liverpool[df_target_clubs_liverpool['club'].isin(['FC Barcelona','RCD Espanyol'])].reset_index()
plot_spider(d,df_technical_attr, 'La Liga');
We see that Barcelona edges Liverpool slightly in technical skills although the league is ranked the third.
d = df_target_clubs_liverpool[df_target_clubs_liverpool['club'].isin(['Liverpool','FC Barcelona'])].reset_index()
plot_spider(d,df_technical_attr, 'Liverpool vs Barcelona');
We will first derive our budget estimate based on all players in our dataset using the mean player wage, which comes out to roughly 105K eur per month.
# See the mean salary of all players in dataset
df_players_regular['wage_eur'].mean()
df_players_gk['wage_eur'].mean()
# calucate 10 member players + goal keeper mean salary
budget = 9843*10 + 6726
budget
df_players_regular.info
df_players_regular.boxplot(column='wage_eur', by = 'preferred_position',figsize=(13, 6))
Now we take the top team in each of the three league and calculate the mean salary for them which comes out to be 117K per month. we will also check the overall skill mean and wage distribution between positions for these teams. The top three teams have 79 players in all.
#Filter out top three clubs
df_topthree_clubs = df_players_regular[df_players_regular['club'].isin(['Liverpool',
'FC Bayern München','FC Barcelona'])].copy()
#Use the Technical skill set from metadata file
l=df_technical_attr.append(pd.Series(['wage_eur','preferred_position','overall']))
#Plot for English Premier League
df_topthree_clubs = df_topthree_clubs[l]
## mean wage of top three teams 117037
## mean buget we are allocated 100000 (991066)
df_topthree_clubs['wage_eur'].mean()
df_topthree_clubs['preferred_position'].value_counts();
df_topthree_clubs.boxplot(column='wage_eur', by = 'preferred_position',figsize=(13, 6));
df_topthree_clubs.overall.mean()
df_topthree_clubs
Now we will get a list of players making less than 100K euros and take the 99.8 percentile, we end up with 78 players to choose from
We can also compare this with the averages of these players with the top league players and the wages paid.
#lets budget 85K players df_players.wage_eur <=0]
df_our_selection = df_players_regular[(df_players_regular['wage_eur'] <=100000)].copy()
#98.8%
value_95 = np.percentile(df_our_selection.overall, 99.8)
value_95
df_our_selection = df_players_regular[df_players_regular['overall'] >= value_95][(df_players_regular['wage_eur'] <=100000) &
(df_players_regular['overall']>= value_95)]
df_our_selection.overall.mean()
df_our_selection.boxplot(column='wage_eur', by = 'preferred_position',figsize=(13, 6))
df_our_selection.shape
We will build the comparative data of "Our Team" and the "Top Three" to plot the spider chart.
l=df_technical_attr.append(pd.Series(['wage_eur','preferred_position','overall']))
#Plot for English Premier League
df_our_selection = df_our_selection[l]
df_topthree_clubs['club'] = 'Top3 Clubs'
df_our_selection['club'] = 'Our Selection'
frames = [df_topthree_clubs, df_our_selection]
result = pd.concat(frames)
result = result.groupby('club').mean().reset_index()
result
In the plot below, we see that "Our Selection" in blue out performas the top three teams in average skills of shooting, passing, dribbling, physic and overall. In defending and pace, our selection is a bit lower. Comparing the wages on a percent basis, our selection would cost 50% less.
This is just a theoritical analysis which as we assume we can use any players in the data set.
#df_technical_attr
l=df_technical_attr.append(pd.Series(['overall','wage_eur']))
#l
result['wage_eur'] =(result.wage_eur/result.wage_eur.sum())*100
plot_spider(result,l, 'Our Selection vs Top three');
result
We will use the features we analyzed in the section explore attributes and class to classify players in four categories (GK, FWD, DEF, MID).
df_pred_pos=df_players_cleaned.copy()
## We see some imbalance in categories of data.
position_counts = pd.DataFrame(df_pred_pos['preferred_position_cat'].value_counts())
position_counts['Percentage'] = position_counts['preferred_position_cat']/position_counts.sum()[0]
position_counts
#df_pred_pos.info()
This is a unbalanced dataset with the percentage distribution shown below. We will use the existing distribution for now.
plt.figure(figsize=(4,4))
plt.pie(position_counts['Percentage'],
labels = ['DEF', 'FWD', 'MID', 'GK']);
from sklearn.model_selection import ShuffleSplit
# we want to predict the X and y data as follows:
#All mental attributes for all players
l=df_mental_attr.append(df_attacking_attr)
l=l.append(df_skill_attr)
l=l.append(df_movement_attr)
l=l.append(df_power_attr)
l=l.append(df_defending_attr)
l = l.append(pd.Series(['height_cm','weight_kg']))
#l=l.append(pd.Series(['wage_eur','overall']))
#l
y = df_pred_pos.preferred_position_cat.values # get the labels we want
X = df_pred_pos[l].values # use everything else to predict!
## X and y are now numpy matrices, by calling 'values' on the pandas data frames we
# have converted them into simple matrices to use with scikit learn
# to use the cross validation object in scikit learn, we need to grab an instance
# of the object and set it up. This object will be able to split our data into
# training and testing splits
num_cv_iterations = 3
num_instances = len(y)
cv_object = ShuffleSplit(n_splits=num_cv_iterations,test_size = 0.2)
print(cv_object)
We will do a logistic regression with three fold cross validation and check the confusion matrix and the accuracy score for each run.
# run logistic regression and vary some parameters
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report
lr_clf = LogisticRegression(penalty='l2', C=1.0, class_weight=None, solver='liblinear' ) # get object
for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(X,y)):
clf =lr_clf.fit(X[train_indices],y[train_indices]) # train object
y_hat = lr_clf.predict(X[test_indices]) # get test set precitions
# print the accuracy and confusion matrix
print("====Iteration",iter_num," ====")
print("accuracy", mt.accuracy_score(y[test_indices],y_hat))
print("confusion matrix\n",mt.confusion_matrix(y[test_indices],y_hat))
plot_confusion_matrix(clf, X[test_indices],y[test_indices],cmap=plt.cm.Blues,values_format='d')
plt.grid(b=None);
print(classification_report(y[test_indices], y_hat, target_names=['DEF', 'FWD', 'GK','MD']))
We were able to get a pretty good f1 score, which is a combination of sensitivity and specificity, for goal keepers and defenders. For forward position, the f1 score was reasonalbe at 78 but a bit low for midfielders at 66. We see that we had some degree of overlap between midfielders and forwards as their skills and physical attributes overlapped
We use the following weights for interpretation but is hard to provide it.
weights = lr_clf.coef_.T # take transpose to make a column vector
variable_names = l
for coef, name in zip(weights,variable_names):
print(name, 'has weight of', coef[0])
In this section, we will use Principal Component Analysis (PCA), an unsupervised linear transformation technique for dimensionality reduction, to study the relationships between 4 main positions: Defense players (DEF), Mid Field players (MID), Forward players (FWD), Goal Keepers (GK). It is a useful technique when we have a large number of correlated features in the dataset. It allows us to summarize the information with a smaller number of collectively representative variables.
We now create a new dataset with all the skill features.
df_players_pca1 = df_players[['preferred_position_cat','pace', 'shooting', 'passing', 'dribbling',
'defending', 'physic', 'gk_diving', 'gk_handling', 'gk_kicking',
'gk_reflexes', 'gk_speed', 'gk_positioning', 'attacking_crossing',
'attacking_finishing', 'attacking_heading_accuracy',
'attacking_short_passing', 'attacking_volleys', 'skill_dribbling',
'skill_curve', 'skill_fk_accuracy', 'skill_long_passing',
'skill_ball_control', 'movement_acceleration', 'movement_sprint_speed',
'movement_agility', 'movement_reactions', 'movement_balance',
'power_shot_power', 'power_jumping', 'power_stamina', 'power_strength',
'power_long_shots', 'mentality_aggression', 'mentality_interceptions',
'mentality_positioning', 'mentality_vision', 'mentality_penalties',
'mentality_composure', 'defending_marking', 'defending_standing_tackle',
'defending_sliding_tackle']]
position_pca_withGK = df_players_pca1.preferred_position_cat
Lets find the two best dimensions of this dataset. We will print the components of the PCA.
from sklearn.decomposition import PCA
df_players_pca2 = df_players_pca1
df_players_pca2.drop(["preferred_position_cat"], axis = 1, inplace = True)
X = df_players_pca2.values
pca = PCA(n_components=2)
X_pca = pca.fit(X).transform(X) # fit data and then transform it
# print the components
print ('pca:', pca.components_)
# this function definition just formats the weights into readable strings
def get_feature_names_from_weights(weights, names):
tmp_array = []
for comp in weights:
tmp_string = ''
for fidx,f in enumerate(names):
if fidx>0 and comp[fidx]>=0:
tmp_string+='+'
tmp_string += '%.2f*%s ' % (comp[fidx],f[:-5])
tmp_array.append(tmp_string)
return tmp_array
feature_names = ['pace', 'shooting', 'passing', 'dribbling',
'defending', 'physic', 'gk_diving', 'gk_handling', 'gk_kicking',
'gk_reflexes', 'gk_speed', 'gk_positioning', 'attacking_crossing',
'attacking_finishing', 'attacking_heading_accuracy',
'attacking_short_passing', 'attacking_volleys', 'skill_dribbling',
'skill_curve', 'skill_fk_accuracy', 'skill_long_passing',
'skill_ball_control', 'movement_acceleration', 'movement_sprint_speed',
'movement_agility', 'movement_reactions', 'movement_balance',
'power_shot_power', 'power_jumping', 'power_stamina', 'power_strength',
'power_long_shots', 'mentality_aggression', 'mentality_interceptions',
'mentality_positioning', 'mentality_vision', 'mentality_penalties',
'mentality_composure', 'defending_marking', 'defending_standing_tackle',
'defending_sliding_tackle']
# now let's get to the Data Analytics!
pca_weight_strings = get_feature_names_from_weights(pca.components_, feature_names)
# create some pandas dataframes from the transformed outputsposition_pca_withGK = df_players_pca1.preferred_position_cat
df_pca1 = pd.DataFrame(X_pca,columns=[pca_weight_strings])
df_pca1['preferred_position_cat']=position_pca_withGK
#sns.regplot(x=df_pca.iloc[:,0], y=df_pca.iloc[:,1])
plt.figure(figsize=(10,10));
ax = sns.scatterplot(x=df_pca1.iloc[:,0], y=df_pca1.iloc[:,1],hue=df_pca1.iloc[:,2]);
ax.set(xlabel="PCA1", ylabel = "PCA2", title = "Principal Component Analysis with all player positions");
By x-axis (PCA1), the first component separates Goal Keepers (GK) with players in other positions as two clusters. By y-axis (PCA2), we can see the second component divides the cluster into 3 groups (by 3 colors green-red-blue) for 3 positions - Defense players (DEF), Mid Field players (MID) and Forward players (FWD). It makes sense because GK is a position that is different to other positions.
By visualization, we can guess that the first component seems to be dominated by players from other positions (not by GK position). We will exclude the GK postion in the next step and then repeat the same study to verify our prediction.
We now create new dataframe without GK skills and then we drop the GK position.
df_players_pca3 = df_players[['preferred_position_cat','pace', 'shooting', 'passing', 'dribbling',
'defending', 'physic', 'attacking_crossing',
'attacking_finishing', 'attacking_heading_accuracy',
'attacking_short_passing', 'attacking_volleys', 'skill_dribbling',
'skill_curve', 'skill_fk_accuracy', 'skill_long_passing',
'skill_ball_control', 'movement_acceleration', 'movement_sprint_speed',
'movement_agility', 'movement_reactions', 'movement_balance',
'power_shot_power', 'power_jumping', 'power_stamina', 'power_strength',
'power_long_shots', 'mentality_aggression', 'mentality_interceptions',
'mentality_positioning', 'mentality_vision', 'mentality_penalties',
'mentality_composure', 'defending_marking', 'defending_standing_tackle',
'defending_sliding_tackle']]
df_players_pca4 = df_players_pca3[df_players_pca3['preferred_position_cat'] != "GK"].copy()
df_players_pca4 = df_players_pca4.reset_index(drop=True)
position_pca_noGK = df_players_pca4.preferred_position_cat
The two best dimensions of the dataset without GK position as follows.
df_players_pca5 = df_players_pca4
df_players_pca5.drop(["preferred_position_cat"], axis = 1, inplace = True)
X = df_players_pca5.values
pca = PCA(n_components=2)
X_pca = pca.fit(X).transform(X) # fit data and then transform it
# print the components
print ('pca:', pca.components_)
feature_names = ['pace', 'shooting', 'passing', 'dribbling',
'defending', 'physic', 'attacking_crossing',
'attacking_finishing', 'attacking_heading_accuracy',
'attacking_short_passing', 'attacking_volleys', 'skill_dribbling',
'skill_curve', 'skill_fk_accuracy', 'skill_long_passing',
'skill_ball_control', 'movement_acceleration', 'movement_sprint_speed',
'movement_agility', 'movement_reactions', 'movement_balance',
'power_shot_power', 'power_jumping', 'power_stamina', 'power_strength',
'power_long_shots', 'mentality_aggression', 'mentality_interceptions',
'mentality_positioning', 'mentality_vision', 'mentality_penalties',
'mentality_composure', 'defending_marking', 'defending_standing_tackle',
'defending_sliding_tackle']
# now let's get to the Data Analytics!
pca_weight_strings = get_feature_names_from_weights(pca.components_, feature_names)
# create some pandas dataframes from the transformed outputs
df_pca2 = pd.DataFrame(X_pca,columns=[pca_weight_strings])
df_pca2['preferred_position_cat']=position_pca_noGK
#sns.regplot(x=df_pca.iloc[:,0], y=df_pca.iloc[:,1])
plt.figure(figsize=(10,10));
ax = sns.scatterplot(x=df_pca2.iloc[:,0], y=df_pca2.iloc[:,1],hue=df_pca2.iloc[:,2]);
ax.set(xlabel="PCA1", ylabel = "PCA2", title = "Principal Component Analysis without GK position");
As our initial prediction, the first component is dominated by DEF, MID and FWD positions. With the second analysis without GK postion, the above plot shows us the relationship of these three positions (in the same cluster) but this cluster still divides into 3 parts represented for DEF, MID and FWD positions. We see some over lap between MID and FWD.